IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 




In re Application of: 



^\ Appl . No . 



10/091,237 



Confirmation No. 
7188 



Filed 



j Applicant 



Su et al . 
March 4, 2002 
Art Unit: 2176 
Stevens, R. 




TC/A.U. 

Examiner 
Attorney 



Docket No. : 
Customer No . : 



10013661-1 



For : 



METHOD AND SYSTEM OF 
DOCUMENT TRANSFORMATION 
BETWEEN A SOURCE 
EXTENSIBLE MARKUP 
LANGUAGE (XML) SCHEMA 
AND A TARGET XML SCHEMA 



Commissioner for Patents 
P.O. Box 1450 

Alexandria, VA 22313-1450 



DECLARATION OF PRIOR INVENTION IN THE UNITED 
STATES TO OVERCOME CITED PATENT (37 C.F.R. § 1.131) 



inventor of the subject matter of the above-identified 
Patent Application. 

The declaration made hereof is to establish a reduction 
to practice of the invention in this Application in the 
United States, at a date prior to November 5, 2001, that is 

the earliest publication date of the paper ''Induction of 

10013661-1/JPW/LCH Page 1 Examiner: Stevens, R. 

Serial No.: 10/091,237 Art Unit: 2176 



Sir : 



My name is Elke Angelika Rundensteiner . 



I am an 



600(3 



XVd IT'LJ S00S/6T/0T 



Integrated View for XML Data with Heterogeneous DTDs^" by 
Jeong et al . which was recently cited by the US Patent 
Examiner in prosecution of the present Application. 

Below stated are the activities regarding the date on 
which the invention in the present Application was reduced 
to practice. 

Reduction to Practice 

A copy of the paper entitled, ''Automating the 
Transformation of XML Documents," hereinafter referred to as 
^'the Transformation Paper," is offered as Exhibit A. The 
authors of the paper include Hong Su, Harumi Kuno, and Elke 
A. Rundensteiner , all inventors of the present invention. 

The Transformation Paper is directed to the discovery 
of transformation operations between two XML schemas of the 
present Application. This paper demonstrates that the 
present invention was reduced to practice. 

Reduction to Practice Date 

The present invention was reduced to practice at least 
as early as November 5, 2001. Specifically, the creation of 
the Transformation Paper entitled, ''Autom.ating the 



10013661-1/ JPW/LCH Page 2 Examiner: Stevens, R. 

Serial No.: 10/091,237 Art Unit: 2176 



€ooia 



XVd 8T:1T fi00Z/2T/0T 



Transformation 
5, 2001. 



of XML Documents r" occurred 



prior 



to November 



Declaration 

I, Elke Angelika Rundensteiner , hereby declare that all 
stai:ements made herein of my own knowledge are true and that 
all statements made on information and belief are believed 
to be true; and further that these statements were made with 
the knowledge that willful false statements and the like so 
made are punishable by fine or imprisonment, or both, .under 
Section 1001 of Title 18 of the United States Code, and that 
such willful false statements may jeopardize the validity of 
the application of any patent issued thereon. 



Dated : 



Signature 

Elke Angelika Rundensteiner 
2 Sutton Place 
Acton, MA 01720 



100 13 661-1 /JPW/LCH 
Serial No. : 10/091, 237 



Page 3 Examiner: Stevens, R. 

Art Unit: 2176 



XVd %\'L\ S002/6T/0 



Exhibit A 

COPY OF THE PAPER ENTITLED 
^^AUTOMATING THE TRANSFORMATION OF XML DOCUMENTS 



10013661-1/JPW/LCH 
Serial No.: 10/091,237 



Examiner : 
Art 



Stevens, R. 
Unit: 2176 



Automating the Transformation of XIVIL Documents 



Hong Su 
Computer Science Dept 
Worcester Polytechnic Institute 
Worcester, MA 01609 

suhong@cs.wpi.edu 



Harunni Kuno 
Hewlett-Packard Labs 
Palo Alto. CA 94304 

harumi_kuno@hp.Gom 



EIke A. Rundensteiner 
Computer Science Dept 
Worcester Polytechnic Institute 
Worcester. MA 01609 

rundenst@cs.wpi.edu 



ABSTRACT 

The advent of web services that use XML- based message ex- 
changes has spurred many eftbrts that address issues related 
to incer-euterprise service electronic commerce interactions. 
Currently- emerging standards and technologies enable en- 
terprises to describe and advertise their own Web Services 
and to discover and determine how to interact with services 
fronted by other busint^sses. However, these technologies 
do not address the problem of how to reconcile structural 
differences between similar types of documents supported 
by different enterprises, IVansformations betw^ieen such doc- 
uments must thus be created manually on a case- by-case 
basis. In this paper, we explore the problem of how to au- 
tomate the transformation of XML B-busiuess documents. 
We develop an integrated solution that automates as much 
as possible all steps of the document transformation pro- 
cess. One, we propose a set of schema transformation op- 
erations that establish semantic relationships between two 
XML document schem<is. 'I'wo, we define a model that al- 
lows us to compare the cost of performing these operations. 
Three, we introduce an algorithm that discovers an eJIicieut 
sequence of operations for transforming a source document 
schema into a target document schema based on our cost 
model. The operation sequence then is used to generate an 
equivalent XSLT transformation script. Experimental re- 
sults indicate that our algorithm can satisfactorily discover 
acceptable transformations. 

1. INTRODUCTION 
1.1 Motivation 

Web services [9] are significantly more loosely coupled than 
traditional applications. Web services are deployed on the 
behalf of diverse enterprises, and the progiammers who im- 
plement them are unlikely to collaborate with each other 
during development. However, the purpose of web services 
is to enable business- to-business interactions. Wvb services 
should be able to discover new services and interact with 
them dynamically without requiring developers to update 
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the code of either service. This need is spurring the cre- 
ation of electronic commerce standards such as ebXML [16] 
and the Web Services Conversation Language (WSCL) [1] 
that all support different aspects of inter-enterprise service 
interactions. 

However, these technologies do not address the significant 
problem of how to reconcile structural differences between 
similar types of documents supported by two different enter- 
prises. For example, let Service A and Service B be services 
that front different companies. Suppose that these servicers 
want to engage in a shopping cart interaction, and that Ser- 
vice B requires Service A to submit a shipping information 
document. Service A might be able to provide this infor- 
mation, but in a slightly different format than Service B 
expects. For example. Service A's document might list the 
address first and Service B's might list it last.. Similarly, 
Service B might call the zip code element "Postal Code" 
where Service A najnt^s it "Zip C'ode". 

Currently, the only recourse of service developers is to cre- 
ate a transformation between the two docuirxents by hand. 
Manual translation of the XML documents is time consum- 
ing and thus especially unacceptable for web services, where 
the information sources change frequently. Hence applica- 
tions must evolve quickly. Clearly, tools are needed to au- 
tomate or at least support this process in as much cis is 
possible. 

1.2 State of the Art 

Schema TVanslation. AJiTEMLb' [2, 3] supports the anal- 
ysis and reconciliation of sets of heterogenecms relational 
schemas by measuring the similarity of element names, data 
types, and structures, Clio [11, 21] uses reasoning about 
SQL queries to create initial mappings between relational 
schemas, then refines these mappings using data examples. 
However, because relational schemas are flat, neither Clio 
nor ARTEMIS can handle hierarchical XML scheniiis, 

lYanScm [12] uses schema matching to derive an automatic 
translation between schema instances. All input schemas are 
transformed into a common model, namely, labeled graphs. 
It offers a set of *'rult»s" that describe how to match a compo- 
nent (i.e., a node in the labeled graph) in the source schema 
with a corresponding component in the target schema. The 
matching is performed node by node starting at the top. 
Rules are checked in a fixed order based on their priorities. 
However since TiuitScm Is a system aiming to provide a 
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general approach, issues such as what special XML match- 
ing rules should be provided to the rule base and sissignment 
of priority for each rule would first need to be solved to put 
the approach in the XML context. 

The machine-learning approach taken in [6] attempts to 
train a learner by a set of user-provided mappings from a 
data source to the global schema and then discovers the 
characteristic instance patterns. Given a new data source, 
one-to-one mappings between the leaf nodes of two schema 
trees can be established by trying out those learned matches. 
However it does not match source-schema elements with a 
hierarchical structure, i.e., the inner nodes in the schema 
tree, as needed for XML. Furthermore, in cases where ex- 
ample data sets of both source and target XML documents 
are available, such an approach could not be applied. 

Tree Matching. Much work has been done in the area of 
tree matching. [14] and [22] address the change detection 
problem for ordered and unordered trees respectively. How- 
ever, the tree matching problem treats the label of each node 
as a second class citizen. For example, the cost of relabeling 
is assumed to be cheaper than that of deleting a node with 
the old label and inserting a node with the new label. How- 
ever if we model an XML schema as a tree, some labels of 
the nodes can be names of the XML tags which carry seman- 
tic meaning. A relabel from one node to another semantic 
unrelated node will cause an undesirable result. Thus the 
cxssumption is invalid for the XML domain. We overcome 
this limitation in our work. 

LaDiff [5] adapts a simple cost model in which insert, delete 
and move are all unit cost operations, i.e., cost is 1. We 
now refine the cost model to take XML characteristics into 
account. LaDiff also assumes that each node of the input 
trees has a special label that describes its semantics (seman- 
tic tag). For example, a tree representing a document may 
have tags "paragraph", "section", etc. And for each leaf 
node in the source tree, there is at most one leaf node in 
the target tree that is "close" to it (unique close partner). 
These assumptions facilitate the matching. MH-Diff [4] al- 
lows flexible cost models and drops the assumptions in [5] 
but then takes quadratic time in the size of the input. 

There are a number of differences between the tree match- 
ing problem studied in [5, 4] and the specific problem of 
matching trees that model XML schemas. First, some of 
their edit operations such as copy and glue in [4] are not 
meaningful for an XML schema. Instead we need XML- 
schema-specific edit operations. Second, the assumption of 
unique close partner in [5] does not necessarily hold true. 
Meanwhile the assumption of semantic label holds for some 
of the nodes in the XML model. That is, some of the nodes 
do not have tags describing their semantics (e.g., constraint 
nodes which will be introduced in Section 2) while others 
do have them (e.g., tag nodes). Hence it is not possible to 
only use the assumptions to direct the mapping. Neither is 
it suitable to completely discard the assumption as [4] has 
done which will result in a high time complexity. 

XML Document Restructuring. [7] studies how to change 
an XML document in terms of both schema and data. It pro- 
pose a set of DTD change primitives that can be applied to 



an old DTD to derive a new DTD and corresponding data 
change will be made implicitly as well. These primitives, 
however, do not cover all the XML schema mappings that 
are most likely to happen. 

1.3 The Xtra Approach 

Since DTDs are currently the dominant industry standard, 
we address the problem of how to transform a document 
conforming to a source DTD so that it will conform to a 
target DTD. Our approach could easily be adapted to XML 
Schema [IS]. Given a source and a target DTD. we first 
model each DTD as a tree. This allows us to express the 
problem as how to transform one DTD tree into another. To 
this end, we have defined a set of DTD iransfoTi nation opcr- 
ations that establish the semantic relationships between two 
trees. We also define a cost model for choosing a .sequence of 
transformation operations among multiple alternatives. We 
have developed an algorithm to discover a sequence of oper- 
ations (i.e., transJoiTnation script) that transforms a source 
DTD tree into a target DTD tree. The discovery proce.ss is 
based on provided auxiliary information (e.g., synonym dic- 
tionary, domain ontology, etc.) and a cost model we define 
for choosing a transformation script among multiple alter- 
natives. Lastly, we use the resulting transformation script to 
generate a extensible Stylesheet Language Transformations 
(XSLT) script [19]. The XSLT script can tlien be applied to 
source XML documents to transform them into XML doc- 
uments conforming to the target DTD. Figure 1 shows the 
architecture of our system. 




Figure 1: Xtra System Architecture 



The primary contributions of our work include: 

L We propose a set of DTD transformation operations 
that capture common discrepancies between alterna- 
tive DTD design behaviors for modeling real- world 
data. 

2. We define a cost model (based on the concept of data 
capacity) for measuring the cjuality of DTD transfor- 
mations. 

3. We have implemented an XML TRAnslation proto- 
type system (Xtra)^ and run experiments on both real 
and synthetic XML document to verify the feasibility 
of our approach. 



*Xt.ra has been demonstrated at ACM SIGMOD 2001. 
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2. DTD DATA MODEL 

Docimient Type Definition (DTD) [17] describes the struc- 
ture of XML documents as a list of element type decla- 
rations. Elements can in turn have content particles, at- 
tributes or be empty. The structure of elements is defined 
via a conleni'Tnodel built out of operators applied to its 
content particles. Content particles can be grouped as se- 
quences (e.g., a,b) or as choices (e.g., a\b) with both a and 6 
being content particles again. For every content particle, the 
content-model can specify its occurrence in its parent con- 
tent particle using regular expression operators (i.e., ?,*,+). 
Attributes can be of various types such as ID, CD ATA, etc. 
They can be optional (^IMPLIED) or mandatory UNRE- 
QUIRED). Optionally, attributes can have a default or a 
constant value (if^FIXED). We model an element type dec- 
laration as a tree, denoted as T = (N, p, I), where A'^ is the 
set of nodes, p is the parent function representing the parent 
relationship between two nodes, and / is the labeling func- 
tion representing the properties of a node. We categorize a 
node n 6 N bcised on its label l{n). 

• Tag node: 

- Element node: Each element node n is associ- 
ated with an element type T. l{n) is a singleton in 
the format of [Name] where Name is T's name. 

- Attribute node: Each attribute node n is asso- 
ciated with an attribute type T. l{n) is quadruple 
in the format of [Name, Type, Def, Val] where 
Name is T's name, Type is T's data type (e.g., 
CDATA, etc.), Def is T's default property (e.g., 
^REQUIRED, if^IMPLIED, etc.), and Val is T's 
default or fixed value if any. 

• Constraint node: 

- List node: Each list node n indicates how its 
children are composed, that is, by sequence (i.e., 
l(n) = [","]) or by choice (i.e., l(n) = ["|"]). 

- Quantifier node: It represents whether its chil- 
dren occur in its parent's content model one or 
more (i.e., l{q) = ("+"], called plus quantifier 
node), zero or more (i.e., l{q) = ["*'*], called star 
quantifier node), or zero or one times (i.e., l{q) = 
["?''], called qmark quantifier node). 

A tree rooted at a node of element type T is called T's type 
declaration tree. We assume each DTD has a uniciue root 
element type whose type declaration tree is then the DTD 
tree. For example, Figure 2 shows two sample DTDs of 
web-service purchase orders. These two DTDs will be used 
as the running example through the paper. The DTDs are 
modeled as DTD trees in Figures 3 and 4. For simplicity, 
we mark each node with its name rather than a complete 
label. In the following, we represent each node by its name 
71 with a subscript i indicating the number of the DTD it 
is within (i.e., 1 or 2). The format is then <n>i^. Since 
each element type declaration is composed of a list of con- 
tent particles enclosed in a parenthes (optionally followed 
by a quantifier), we do not explicitly model the outermost 
parenthesis construct as a sequence list node in the DTD 
trees. For example, <name>2 has two children <fir$i>2 
and <l.ast>2 rather thein a sequence list node. 

^If there is duplicate name, another subscript specifying 
which one it is can be added, i.e., <nj>i 



<! ELEMENT company (address, cname , personnel)> 

<!ATTLIST company license CDATA aiHPLIED> 

<?ELEMENT address (street, city, state, zip)> 

<! ELEMENT cnawe («PCDATA)> 

<! ELEMENT personnel (person) +> 

<! ELEMENT street (ttPCDATA)> 

<! ELEMENT city («PCDATA)> 

<! ELEMENT state (*tPCDATA)> 

<! ELEMENT zip («PCDATA)> 

<! ELEMENT person (name, email?, url?, fax+)> 
<!ELEHENT name (f aiaily Igiven |middle?)«> 
<1 ELEMENT email («PCDATA)> 
<• ELEMENT url (aPCDATA)> 
<! ELEMENT fax (aPCDATA)> 
<! ELEMENT family (»PCDATA)> 
<! ELEMENT given C«PCDATA)> 
<! ELEMENT middle («PCDATA>> 

(a) 

<!ELEMENT company (cname , (street , city, state, postal), 

personnel .license? )> 
<! ELEMENT cname («PCDATA)> 
<! ELEMENT street (itPCDATA)> 
<! ELEMENT city (#PCDATA)> 
<! ELEMENT state (ttPCDATA)> 
<! ELEMENT postal («PCDATA)> 
<!ELEHENT personnel (person)+> 
<• ELEMENT license («PCDATA)> 

<! ELEMENT person (name, email*, url?, fax. fax?, phonenum)> 

<! ELEMENT name (first, last)> 

<! ELEMENT email («PCDATA)> 

<! ELEMENT url (»PCDATA)> 

<! ELEMENT fax («PCDATA)> 

<! ELEMENT phonenum (aPCDATA)> 

<! ELEMENT first («PCDATA)> 

<• ELEMENT last («PCDATA)> 

(b) 

Figure 2: Example DTDs of Web Service A and B*s 
Purchase Orders 

3. TRANSFORMATION OPERATIONS 
3.1 Taxonomy of Transformation Operations 

We identify two primary causes of discrepancies between the 
DTD components modeling the same concepts: First, the 
properties of the concepts may differ. For example, phone 
number is required in contact information in DTD 2 wliile 
it is not required in DTD 1. Second, due to the relatively 
free-form nature of XML and lack of a standard for DTD 
design, a given concept can be modeled in a vai iety of ways. 
For example, an atomic property can be represented as ei- 
ther a i^PCDATA sub-element or an attribute. We have 
studied the common DTD design patterns and correspond- 
ingly proposed a set of transformation operations, as listed 
below^ . 

1. add(T, n): Add a subtree T under node n, i.e., add a 
new content particle to element n's content model. 

2. insert(n, p, C): Insert a new node n undej- node /; with 
n a quantifier node or a sequence list node and move a 
subset of p's children C to become 7?.*s cliildren. If 77. is a 
quantifier node, the operation changes the occurrence 
property of the children C in p's content model from 
"exactly once'' to correspond to n. If n is a sequence 
list node, it groups the nodes C. 

n cannot be an attribute node since an attribute node 
would not have any children. And we do not allow n to 
be an element node because it may cause undesirable 
matches. See Example 1. 

^VVe use "child" to refer to direct child versus "descendants". 
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Figure 3: DTD I's DTD Tree 




Figure 4: DTD 2's DTD Tree 

3. delete(T)'. Delete subtree T, i.e., delete a content par- 
ticle T from a content model. This is the reverse op- 
eration of add. 

4. removefn): Remove node n with n a quantifier or a 
sequence list node. All n's children now become p(n)'s 
children. This is the reverse operation of insert 

5. rclabel(n, I, V): Change node n's original label I to I' . 

• relabel within the same type (the operation does 

not change the node's type): 

(a) Renaming between two element nodes, two 
attribute nodes or two quantifier nodes but not 
between a sequence list node and a choice list 
node; (b) Conversion between an attribute's de- 
fault type UNREQUIRED and i^IMPLIED. 

• relabel across different types (the operation changes 
the node's type): 

(a) Conversion between a sequence list node and 
an element node which has children. This cor- 
responds to using a group instead of an element 
type or encapsulating the group into a new ele- 
ment type. See Example 2. (b) Conversion be- 
tween an attribute node with type CD ATA, de- 
fault type REQUIRED, no default or fixed value 



and a PCDATA element node; (c) Conversion 
between an attribute node of default property 
# IMPLIED and a if: PCDATA element node with 
a qmark quantifier parent node, (b) and (c) are 
allowed in that people can model a one-to-one re- 
lationship property as either an attribute or an 
subelement. See Example 3. 

6. unfold{T, <ri, 72, Ti>): Replace subtree T with a 
sequence of subtrees 7"i , T2, 7",. T must root at a 
repeatable quantifier node. Ti, Ta, and Ti satisfy 
that: (1) they are adjacent siblings; and (2) they or 
their subtrees without a qmark quantifier root node 
are isomorphic, unfold recasts a repeatable content 
particle as a sequence of non-repeatable content parti- 
cles. See Example 5. 

7. fold(<T\ , T2, Ti>,T): This is the reverse operation 
of unfold. 

8. split{in\ ,<n\ , n2>): A sequence list node mi is split 
into a star quantifier node ni and a choice list node fi-z. 
Because there is no DTD operator to create unordered 
sequences, tuples <a.h> tend to be expressed using 
the construct (a\h)' rather than (a ,b)\(h,a). This op- 
eration corresponds to converting an ordered sequence 
to an unordered one. See Example 5. 

9. 7nerge(<ni.n2>.tn]): n.i and are merged into a 
single node mi with nj a star quantifier node, 112 a 
choice list node and mi a sequence list node. This is 
the reverse operation of split. 

Example 1. For the DTDs shown as below, it is inap- 
propriate to derive <na7ne>2 ff'orn < a a me>i by i ns crt t ng a 
tag node CEO between company and name since <name>'2 
indicates the company's CEO's name while <nanie>i in- 
dicates the company 's name. We would rather first delete 
name and then insert a subtree rooted at CEO to dcHne 
DTD 2. 

<! ELEMENT company (najne , address .webpage) > 
<< ELEMENT nane («PCDATA)> 
<r ELEMENT address (#PCDATA)> 
<! ELEMENT webpage («PCDATA)> 
DTD 1 , 

<! ELEMENT company (CEO , webpage )> 
<! ELEMENT CEO (name, address) > 
<! ELEMENT name («PCDATA)> 
<! ELEMENT address (#PCDATA)> 
<! ELEMENT webpage (ttPCDATA)> 
DTD 2 

Example 2. <\ELEM ENT company (street, city , state, 
zip)> can be transformed from <!E LEM ENT company 
(address )> <!ELEM ENT address (street, city, state. zi.p)> 

Example 3. <\ELEM ENT company (license? )> 
<!ELEMENT license (PCDATA)> can be transformed 
from < IE LEM ENT company (EMPTY)> < /ATT LI ST 
company license CD AT A #1 M PLI ED> by a relabel op- 
eration. 

Example 4. <\ELEM ENT person (fax-h )> can be un- 
folded to <\ELEM ENT person (fa:c, fax) or 
<!ELEM ENT person (fax. fax?)>. 

Example o . <\ELEM ENT na m e ( f irst . la st }> is split 
into <\ELEMENT name (f irst \last )' > . 
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3.2 Constraints on the Transformation 

While our atomic operations reflect intuitive transforma- 
tions, some combinations of operations may result in non- 
intuitive transformations. For example, for the DTDs shown 
in Example 1 in Section 3.1, we can derive DTD 2 from DTD 
1 by first inserting a sequence list node above name and 
address, and then relabeling the sequence list node to tag 
node CEO. This is equivalent to the forbidden operation of 
inserting a tag node CEO above name and address. 

Common design patterns show that an element type dec- 
laration will not be deeply nested. A survey of real world 
DTDs [13] analyzes 65 DTDs available at [20]. The depth of 
content models is defined as: 0 for EMPTY; 1 for a single 
element, a sequence or a choice; n for an alternation of 
sequences and choices of depth n. For example, {a, {b\{c,d))) 
has depth 3. It turns out that the maximum depth of an 
element type's content type is almost around 2 and 3 with 
the average depth being even lower. This is because com- 
plex regular expressions are not advisable since it is difficult 
to understand. Also, usually the complex expression can 
be reAvritten by some simpler ones. According to this de- 
sign pattern, if a node n\ has a matching partner n2, it is 
highly likely that ni and have a similar depth in the sub- 
trees rooted at their nearest matching ancestors in the DTD 
trees. This gives a hint that if the DTDs are designed in a 
common manner, the search space can be pruned to gain 
time efficiency. Therefore we discover only change scripts 
that do not violate the constraint that a node can be only 
operated on by a non-reiabel operation optionally followed or 
following a relabel operation^ . 

4. COST MODEL OF OPERATIONS 

These operations can be combined into a variety of equiv- 
alent transformation scripts. In order to facilitate selec- 
tion among alternative transformations, we propose a cost 
model that evaluates the cost of transformation operations 
in terms of their impact on the data capacity of the docu- 
ment schemas. Relative information capacity [8] measures 
the semantic connection between database schemas. That 
is, two schemas are considered equivalent if and only if there 
is a one-to-one mapping between all data instances in the 
source and the target schema. We assume that the DTDs 
in our study are flat [10], i.e., no schema information such 
as the names of element or attribute types in one DTD are 
stored as PCDATA or values of attributes in an XML doc- 
ument conforming to another DTD. Hence we only consider 
PCDATA and attribute values in XML documents as data. 
The data capacity of an XML document denotes the collec- 
tion of all of its data. 

4.1 Factors of the Cost IModel 

Data capacity gap. We say a transformation operation is 
data capacity reducing (DC- Reduce) if it must result in the 
loss of data, e.g., delete a subtree. Correspondingly, we have 
data capacity increasing {DC- Increase) operations, e.g., add 
a subtree, and data capacity presei^ing (DC-Presei^e), e.g., 
relabel an element node to a sequence list node. However, 
for some operations, it is difficult to determine fiom the 
DTDs alone whether the transformation will result in the 

Relabel is the only operation that does not change the 
tree's topology. 



loss, addition, or preservation of data capacity. For exam- 
ple, the operation remove quantifier node < changes 
the content particle from non-required to required which may 
cause an increase in data. Ft also clianges the content parti- 
cle from repeatable to non-repeatable which may cause data 
reduction. Hence reducing, increasing or preserving of data 
capacity are all po.ssible and depend on the individual source 
XML document. We call these transformations data capac- 
ity ambiguous { DC- Ambiguous). We use DC{op) to denote 
the cost that the data capacity gap of the operation op con- 
tributes to op's overall cost. For any two ops falling into the 
same category, DC(op) is the same and 0 < DC(op) < 1. 

Potential data capacity gap. Although some transfor- 
mations are data capacity preserving, there may still be a 
potential data capacity gap between a document conforming 
to the source DTD and one conforming to the target DTD. 
For example, the operation insert a quantifier node < "> 
is a DC- Preserve transformation. However, it changes the 
children content particles' occurrence property from count- 
able to non-countable and then allows the XML documents 
to accommodate more data in the future. We use PDC(op) 
to denote the cost that the potential data capacity gap 
contributes to operation op's overall cost. Then we define 
PDC(op) = lUrequirtd * required.changed(op) -h lOcountabU * 
repeatable-Coantable{op). where rcqwircd^changed(op) and 
countable^changedfop) are two boolean functions that in- 
dicate whether the properties "required"' or "countable" of 
the content particles that are operated on by op are changed 
or not. Weights u-Ve^uired and lUcouniabie indicate the impor- 
tance of the change of the corresponding property for the po- 
tential data capacity. They satisfy lUrequired'^u^counLabu = L 
Only operations that insert, remove or relabel a quantifier 
node may have a PDC cost that is not 0. For example, 
suppose lUrequired = ^countable = 0.5, then for op of in- 
sert a quanlifer node < "-f*">, required.changed(op) = 0, 
countable -changed(op) = 1, therefore POC{op) = 0.5 * 0 + 
0.5 * 1 — 0.5. 

Relative factors of operands. The number, size or prop- 
erty of operands involved in an operation may impact the 
data capacity or the potential data capacity gap. We use 
Fac{op) to denote the factor. For example, for a content 
model (/a.r-j-), a fold operation deriving it from {fax, fax) 
will be more expensive than that deriving it from (fax. 
fax, fax). This is because the former one causes a greater 
potential data gap. For another example, when relabel- 
ing occurs between two tag nodes, if tlieir names are syn- 
onyms (e.g., ''zip' and "po.s/a/"), Fac(op) is 0. If no knowl- 
edge about the relationship of the two names' relationship 
is available, then Fac(op) is proportional to the similarity of 
their name strings. For example, Fac{op) of relabeling be- 
tween "ac/dress" and ""addr"' will be less than that between 
^^address'' and ^''capital" . 

We then have, Cost{op) = {DC(op) + PDC{op)) * Foc{op). 
The user of the Xtra system can customize the cost model by 
tuning the DC (op), PDC (op) and Fac(op). We also provide 
a default setting. Intuitively information loss is not desirable 
in that old information cannot be reconstructed from the 
new information. Hence the more information loss the oper- 
ation causes, the more expensive the operation is. Therefore 
we rank the cost of each data capacity gap category from 
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lower to higher in the order of DC- Preserve^ DC- Increase^ 
DC- Ambiguous and DC- Reduce. However our algorithm of 
discovery the transformation script does not depend on this 
particular relationship. 

4.2 Examples 

We illustrate how to use the cost model to choose a match- 
ing plan from multiple candidates using the running exam- 
ple. Assume we have the following settings: DC-Preserve^ 

0. 6; DC-Increase, 0.8; DC- Ambiguous, 0.9; DC-Reduce, 1.0. 
And Fac(op) for an operation op of relabeling between an 
element node and a sequence list node is 3 while that for re- 
labeling between two synonym element nodes is 0. Also, for 
an operation op that adds a subtree, we define that Fac(op) 
is proportional to the number i of subtree's leaf nodes (i.e., 
Fac(op) = k * i) and we assume ^ = 1. Suppose now we 
have two options to derive the subtree rooted at <, >2. The 
first option is to match <Qddress>i to <,>2, i.e., relabel 
<address >i from [address] to [,] and relabel <zip>i from 
[zip] to [post Q I]. The second option is to add a new subtree 
which is the one finally rooted at (<,>2 For the first op- 
tion, the cost of the first relabeling is (DC{op) -\- PDC(op))* 
F ac(op) — (0.6 + 0) * 3 = 1.8 while the cost of the second re- 
labeling is iDC(op)-i- FDC(op))* Fac(op) = (0.6 + 0) *0 = 0. 
The total cost is then 1.8 + 0 = 1.8. For the second option, 
the cost is {DC{op) PDC{op))* Faciop) = (1.0 + 0)*4 = 4. 
In this case, the first option is preferable due to its lower 
cost. However, suppose that < address >i only has one 
single child node <postal>\. The first option now has three 
more operations than the original sequence of operations, 

1. e., add <street>\, add <city>2 and add <state>i. These 
three additional operations cost (0.8 + 0)*l = 0.8 each. The 
total cost of the first option is therefore i.8 -t- 0.8 * 3 = 4.2. 
The cost of the second option is (i.e., 4) does not change 
since the sequence of operations does not change. This time 
the first option is not preferable since it is more expensive 
than the second option. 

5. GENERATION OF DTD TREE MATCHES 
5.1 Simplified Element Trees 

We constrain our investigation to the domain of E3-business 
documents that are exchanged between services that share 
a conmion ontology. We thus can use name similarity as 
the first heuristic indicator of a possible semantic relation- 
ship between two tag nodes. For example, in Figures 3 and 
4, the matching document root type all have a child node 
named personnel, so without looking at their descendants, 
we can match these two nodes. We can derive the match- 
ing between two personnel nodes' descendants by comparing 
two personnel's type declaration trees separately. However 
suppose in DTD 2, people were used instead of personnel and 
no synonym knowledge was given. We would then need to 
look further at personnel and people'a descendants to decide 
whether to match them. 

We therefore introduce the concept of a simplified element 
tree. Such a tree is designed to capture the relationship 
indicated by names between specific elements of the two 
documents. When two DTDs are provided, we say a tag 
node is non-rename-able if there exists any tag in the other 
DTD whose name is the same or a synonym. In another 
word, the Fac(op) cost of an operation op of renaming a 



non-rename-able node to another node with a non-synonym 
name is infinitely expensive. A simplified element tree of 
element type denoted as 5T, is a subtree of E'i> type 
declaration tree T that roots at T*s root with each branch 
ending at the first reachable non-rename-able node. In Fig- 
ures 3 and 4, the four subtrees within the dashed lines are 
simplified element trees of company, personnel, person and 
name in the two DTDs respectively. 

5.2 The Matching Algorithm 

DMaich{T>td Match) is an XML-structure-spccific tree match- 
ing algorithm. The general unordered tree matching prob- 
lem is a notoriously high complexity NP problem [22]. As we 
have mentioned in Section 1.2, the typical as.suinption about 
relabeling in general tree matching does not hold in our sce- 
nario. Thus those techniques do not apply to our scenario. 
Our dMaick tree matching algorithm here incorporates the 
domain characteristics of specific DTD tree transformation 
operations, the imposed constraints and the cost model. 

Given a source simplified element tree Ti , and a target sim- 
plified element tree T^, we call nodes in Ti source n-odcs and 
nodes in T2 target nodes. If /ii and n'2 are a source and 
a target node respectively, the DMnt.ch(ni . n.2) algorithm 
discovers a sequence of operations (i.e., the transformation 
script) that transforms the subtree rooted at ni to the sub- 
tree rooted at n.-2. We call the sequence of operations a trans- 
fojynaiion script.. The total cost of the script is then the cost 
of matching (deriving) ni and rig. For the source nodes that 
are deleted or removed, since they are not mapped to any 
existing target node, in order to simplify tlie description, we 
specify two special nodes, naniely, <^l and <^-2 A node which 
is removed is said to be mapped to And a node which 
is deleted is said to be mapped to <&2. 

DMatch is composed of two phases. The first phase is pre- 
processing. And We have two special nodes, namely, ^1 
and ^2. *i is mapped to nodes which are operated on by 
remove operation. And 02 is mapped to nodes which are 
operated on by delete operation. The operations add, in- 
sert, delete, remove and relabel set up a one-to-one mapping 
relationship. However, the operations unfold, fold, split and 
merge set up a one-to-many relationship. For example, un- 
fold maps one subtree to nmltiple subtrees, split maps two 
nodes (a star quantifier and a choice list node) to a sequence 
list node. In order to make the matching discovery process 
for each node be uniform, we preprocess each of the simpli- 
fied element trees. 




Figure 5: Preprocessing: Fold 
In the preprocessing phase, we will convert all the repre- 
sentations of a sequence of non-repeatable content particles 
into the format of a repeatable content particle, i.e.. perform 
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fold operations.' For example, Figure 5 (a) will be converted 
to Figure 5 (b). In Figure 5 (b), we mark the plus quanti- 
fier node with a number (i.e., two) indicating the maximum 
occurrence of content particle fax. By marking these nodes 
resulting from the preprocessing, we are able to keep track 
of where they are derived from. Thus we will not lose the 
information needed for computing the overall cost. Second, 
we will impose a certain order on those representations that 
allow arbitrary combination of content particles, i.e., per- 
form merge operations. 

In the second phase, we then find one-to-one node mappings. 
To derive the transformation from the subtree rooted at ni 
and the subtree rooted at /la, for each child m\ of ni, we 
attempt to find a matching partner 7712 (a matching partner 
can be either be also a special node $1 or $2). This match- 
ing discovery is done in two passes. In pass 1. we visit each 
child mi of n\ sequentially and compare it with a certain 
set of target nodes. We call the set of nodes that will be 
compared with the current source node matching candidate 
set Since we have the constraints that a node cannot be di- 
rectly operated on more than once, mi's matching partner 
77^2 can only be on the same level as mi (no operation or 
lelabel operated on mi) or one level deeper than mi (insert 
operated on mi) or a special node (delete or remove oper- 
ated on m\). By recursively applying DMatch to mi and 
each node s, (0 < z < |5|) in 5, we find a node Sk with the 
least matching cost c. We have a control strategy to decide 
whether to match mi with Sf^. In pass 1, we apply a delay- 
match scheme which delays matching Sk with m\ if the cost 
is not low enough, i.e., c is not less than the cost of deleting 
•mi. Thus 6A; can be preserved to be matched with another 
node associated with a satisfactorily low cost. 



Sou rce 


Matcl^ing Candidate Set 


element 


element node 011 the same level. 


attribute 


attribute node on the same level. 


choice 


choice node on the same level. 


sequence 


sequence node on same level or one level deeper; 


quantifier 


quantifier node on same level or one level deeper; 
*i. 


Figure 6: Matching Candidate Set in Pass 1 
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element 


element node on same level or one level deeper: 
sequence node on same level: 
attribute node on same level. 


attribute 


elen^ent node on the same level. 


choice 


choice node on same level or one level deeper. 


sequence 


sequence node on same level or one level deeper: 
quantifier node on same level. 




quantifier 


quantifier node on same level or one level deeper; 

*i: 

sequence node on same level. 





Figure 7: Matching Candidate Set in Pass 2 



After visiting all children of ni , we begin pass 2 and traverse 
all unmatched children of n\ again, comparing them with 
possible candidates. Again, we apply DMatch to m\ and 
each node s,- in S and find a node Sk with the least matching 
cost c. Now a must-match scheme is applied in contrast to 
the delay-match scheme in pass 1. mi would be matched 
with sjt if c is less than the cost of deleting mi and adding sjt- 
Figures 6 and 7 show the nodes that will be selected into the 
matching candidate set S in ptvss 1 and pass 2 respectively. 



The pseudo code of the DMatch algorithm is shown in Figure 
8. The full details of the algorithm, including discussion of 
optimality, time complexity, etc. can be found in [I.5]. 

DMatch{ni , n.3) 
{//pass 1 

for ni's each child mi 
{ 

Set S = 7711 's matching candidate set in pass 1; 
for each node sr c 5 
apply D Match (in I . s;): 

select a node st- in S associated with a least co.st; c 
if c < the cost of deleting m\ 
match mi and st-. //delatf-mafch scheme 

} 

//pass 2 

for TJi's each unmatched child ni\ 
{ 

Set S' = m', 's matching candidate set in pass 2: 
for each node s| € S' 

apply £>A^a/c/if777', . 5'); 
select a node s[. in S' associated with a least cost c'; 
if c' < the cost of deleting 771 j + the cost of adding s[. 

match m\ and s[.. // mu s t- tti at ch scheme 

} 

Figure 8: Pseudo Code of DMatch Algorithm 

Given matching root element types, Ri and R2 of two DTDs, 
we apply DMatch to the roots of /?i and H-i's simplified tree. 
The root match is propagated down the tree and matches 
between the name-match nodes of element types E\ and 
E2 are identified. We then recursively apply the DMatch 
algorithm to Ei and £'2's simplified trees until no new name- 
match node matches are generated. 

5.3 Example Illustrating Matching Process 

We now describe how the match discovery between DTD 1 
and DTD 2 depicted in Figures 3 and 4 would be done by 
our system. We will use the same settings as shown in the 
examples in Section 4.2. 

.As shown in Figures 3 and 4, there are four pairs of sim- 
plified element trees, i.e., company, personnel, person and 
name. We apply DMatch to* the root type company's sim- 
plified element trees first. We traverse <company> 1 's chil- 
dren one by one. For <addr€ss>i, its matching candidate 
set is empty since all the element nodes on the same level 
(i.e., 2) are non-rename-able. For <cnQme>i, its matching 
candidate set contains only <cnome>2- Since they have the 
same name, they are matched. Similarly, <personncl>i is 
matched against < per son net > 2. For attribute <id>i, its 
matching candidate set is empty. In pa.ss 2, <Qddress>\'ii 
matching candidate set contains only <.>2. We apply DMatch 
to them and derive the transformation script composed of 
an operation of relabeling ^'address" to As illustrated 
in Section 4, <address>i will be mapped to <,>2. At- 
tribute <hcense> I's matching candidate set now contains 
element <licen.se>2- And with the parameter setting, they 
will be matched. Now each of <company>i*i> children has 
a partner. Hence we are done with matching element type 
company. 

We continuously apply DMatch to the element simplified 
trees of each pair of element type matched by name, i.e., 
personnel, person and name. In this way, all matches be- 
tween them are discovered. 
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6. DISCUSSION 

Once the relationship has been set up, an XSLT generator 
will generate an XSLT script for transforming the source 
XML documents into the target format. We have imple- 
mented a working system A'7>'o and run experiments on it. 
The data sets we are using for experimentation include both 
real world data collected from [20] and synthetic data. It 
turns out that our algorithm can satisfactorily discover ac- 
ceptable transformations. Due to the space limitation, we 
do not further discuss them here. The details can be found 
in [15). 

7. CONCLUSION AND FUTURE WORK 

This work proposes an approach for automating the trans- 
formation of XML documents. Specifically, we focus on two 
fundamental problems. First, we address the problem of 
how to automate the identification of semantic relationships 
between XML-based documents. To this end, we propose 
a set of DTD transformation operations that capture com- 
mon discrepancies between alternative DTD design behav- 
iors for modeling a given entity. We also define a cost model 
for quantifying the quality of XML schema transformations. 
Second, we have developed an algorithm "that performs the 
actual transformation of an XML-based document from a 
given schema to a different, yet related, schema. Our work 
is unique because we incorporate domain-specific character- 
istics of the XML documents, such as domain ontology, com- 
mon transformation types, and specific DTD modeling con- 
structs (e.g., quantifiers and type-constructors). This allows 
us to avoid the high level of user interaction as well as the 
complexity required by other approaches. We have imple- 
mented a prototype system (Xtra), and run experiments on 
both real and synthetic data to verify the validity of our 
approach [1.5]. 

XML-Schema [18] is emerging as a potential standard for 
describing the structure of XML documents. In the future 
we could investigate how to adapt our approach to exploit 
the richer treatment of types offered by XML Schema as 
additional hints of similarity. 
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